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ABSTRACT 

We propose a generalization of SimRank similarity measure 
for heterogeneous information networks. Given the informa¬ 
tion network, the intraclass similarity score s(a, b) is high if 
the set of objects that are related with a and the set of ob¬ 
jects that are related with b are pair-wise similar according 
to all imposed relations. 

Categories and Subject Descriptors 

[Information systems]: Retrieval models and ranking- 
Similarity measures 

General Terms 

SimRank, Probabilistic SVD, Tensor, Low-rank approxima¬ 
tion 

1. INTRODUCTION 

Most data in the modern world can be treated as an infor¬ 
mation network, thus network node similarity measuring has 
wide range of applications: search [^, recommendation sys¬ 
tems [^, research publication networks analysis biology 
transportation and logistics and others. 

Consider a semantic network: set of types T, each type 
t G T is a set of entities; set of relations 7^, each relation is 
2-order predicate defined on two types from T: 

U 3 rtp : t X p {1,0}, t,p G T, 

both types in relation can be equal (rtt :txt^{0,l}), few 
relations can share the same pair of types ^ G 

{0,1}^^^). That structure may be considered as a graph 
with colored vertices and colored edges: vertex color is its 
entity type, edge color correponds to a relation. 

The question that we address is how to define similarity 
functions 

St : t X t ^ R, Vt G T, 
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A Type of: devised structured activity 

Instance of : candidate KB completeness node, clarifying 

collection type, type of object, type of temporally stuff-like 

thing 

Subtypes: board game, brand name game, card game, 
child’s game, coin-operated game, dice game, electronic 
game, fantasy sports, game for two or more people, game 
of chance, guessing game, memory game, non-competitive 
game, non-team game, outdoor game, party game, puzzle 
game, role-playing game, sports game, table game, trivia 
game, word game 

Instances: ducks and drakes, ultimate frisbee, darts, 
pachinko. Crossword Puzzle Activity CW, pool. Snooker, 
mini golf 


Figure 1: OpenCyc ontology node of concept 
’’Game” 


that would reflect the closeness of objects based on ’’similar¬ 
ity of relations” they enter, and at the same time not mix¬ 
ing different relations as soon as ’’objects of different types 
and links carry different semantic meanings, and it does not 
make sense to mix them to measure the similarity without 
distinguishing their semantics” [^. 

1.1 Related work 

The basic graph structure similarity measure is the clas¬ 
sical SimRank over a homogeneous graph G = {V, E) 
which is defined as follows: 

Ncia) = {v eV : {v,a) e E{G)}, 

^ ^ ^ ^ veN(a) 

weN{b) 

The main drawback of this approach is that we cannot in¬ 
duce multiple relations or object types, so the only option 
is mixing them up into blobs ’’relation exists” and ’’all ob¬ 
jects” that is completely not applicable in the case we have 
multiple relations with different semantics, for example the 
OpenCyc ontology node of the concept ’’Game” (see Figure 
cannot be easily expressed via a single type of relations 
and objects. 

Personalized PageRank is also often used to measure 




similarity in homogeneous graphs: 

TTa{b) = eSaib) + {1 - e) V] , 

(w,b)eE 

that it same as Page Rank, except random jumps are made 
into some pre-chosen node 5, rather then into random node. 

Another option is PathRank that measures path-similarity 
between objects a, b picked from the same class A of the 
heterogeneous information network J\f given a symmetric 
meta-path V (set of paths that satisfy composition of re- 

lations that A ^ Ci ^ C 2 ■ ■ ■ ^ A, 

SO A A) as a number of paths from the ob¬ 

ject a to the object b (each step i must satisfy corresponding 
relation Mi in V) normed over the number of paths from a 
to a plus the number of paths from b to b given V: 


Up e P : g ^ 

\\{peV:a^a}\\ + \\{per:b^b}\\ 

That approach can handle several relations and object types 
and is very useful when we know the structure of relations 
we want our similarity measure to be based on. In case we 
want just to ’’put our relations into a black box” that would 
find similarity that would capture all network relations as a 
whole, we might want to use something different. Recently, 
an approach for building an optimal linear combination 
of meta-paths has been proposed. 

There are several works on measuring similarity between 
objects from different classes, see, for example, p^ . 


sr{a,b) 


2 . TENSOR SIMRANK 
2.1 Problem statement 

Let us consider a function st{a,b) that assigns similarity 
score for two objects from the same class t as follows: objects 
a,b ^ t ^ T are similar (value st(a, b) is high) if they relate 
to objects which are similar too. That interdependence can 
be expressed via the following definition: 


Nrtpia) = {b ep\rtp{a,b) = 1 }, 

st(a,b) = i w{rtp) Sp{c,d), 

ceNp(a) 

deNpib) 

where rtp is the relation between classes t,p G T, Nrtp is the 
neighbourhood function that returns set of objects from the 
class p that are related to the object a via the relation rtp, 
w(rtp) are the weights corresponding to the relation rtp^ Z 
is the normalization constant. 

This can be rewritten as a Tensor SimRank equation: 

SqcP — ^ ^ '^a/37 I’a/37 SqcP 

7 ( 1 ) 

S = diag({st}tGr), Saa = 1, 

where s is a block-diagonal matrix (one block per each entity 
type), w are the relation weights, Vapj are the stoehastie re¬ 
lation tensors 0 (which have non-zero blocks where relations 
exist). 

^We have to use tensors instead of matrices to have multiple 
relations on the same pair of classes 


Similarity scores between elements of different classes are 
equal to zero by the definition. Relation between objects of 
unrelated classes is equal to zero by definition too. Equa¬ 
tion 0 is basically the classical SimRank equation with the 
adjacency tensor instead of the adjacency matrix: each non¬ 
zero layer of tensor encodes some relation on the same pair 
of types. If one has more than a single relation between 
types p, t G T, then r would have multiple non-zero layers 
on the intersection of indices associated with the classes t,p 
— one adjacency matrix per layer. In 0 the index 7 stands 
for (weighted) summation over all layers of the tensor. That 
can be equivalently rewritten explicitly: 

S = Y^ W-yW-ySW^ + D, (2) 

7 

where the diagonal matrix D has to be chosen in a such way 
that diag(*S') = /. 

2.2 Computational algorithm 

Simple iterations for Q are computationally demanding 
due to large-scale matrix-by-matrix products, thus we pro¬ 
pose a a method that exploits the fact that s is block diago¬ 
nal and r is a three-dimensional block tensor with size of the 
last dimension (number of layers) much less then the overall 
amount of objects. On each iteration k for each r E 71 we 
recompute Si updates independently (assuming all other Sj 
fixed), see Algorithm 


Algorithm 1: Idea under Tensor SimRank 
Data: T - classes, IZ - relations 
Result: = {st(a, 5)}tGT 

repeat 

for St G <S do 

assume all S \st fixed 
for r G 7^ : rtp : t X p 1 -^ {1, 0} do 
for (a, 6) G t do 
for (c, d) E p do 

b) += rtpia, c)sp{c, d)rtp{b, d) 
+= rtp{a, c)sp(c, d)rpt(d, b) 

end 

end 

end 

end 

update all st ^ 
until ll«‘ - < <5; 


So we just update the similarity score for each class as¬ 
suming all other classes similarities are fixed in a way that 
the objects from the target class (t) that are related to ob¬ 
jects from some other class (c, d) E p that are close (sp(c, d) 
is high) become closer too (st(a, 6 ) t)- 

To show actual vectorized algorithm of similarity compu¬ 
tation, let us introduce some additional notations: set of 
entity types T = {L}£o? each entity type t is a set of en¬ 
tities, set of symmetric relation functions 71 = {r[^p^}f=Q 
where : t x p ^ { 0 , 1 }, t^p j is the order; column- 
stochastic matrix of pairwise types impacts (weights) w G 
operator W : r^^^ Mlld|x||p|| maps relation 














into corresponding column-stochastic adjacency matrix. If 
rtp is not defined for some (t,p) G T^, then wtp — 0 . 


Algorithm 2: Vectorised Tensor SimRank for HSM 
Data: T - classes, IZ - relations, w - relation weights 
Result: S = {st{a,b)}teT 

for t E T do 

I sr^=I 

end 
k = 0 

repeat 

for t E T do 

Suew ^ Q 

for 1Z 3 r : t X p {1, 0} do 
I ^new ^ ^new WtpWirtp)si’‘^W{rpt) 

end 

k = k + 1 

end 

for t E T do 

I - diag{sT'^) + 7 

end 

until 


To achieve better results (see above) on sparse relations 
we adopted the Low-Rank SimRank approximati on [11] that 
uses Probabilistic Singular Value Decomposition [12| to per¬ 
form fast approximate projections on low-rank matrix man¬ 
ifold at each step of the iterative process (Algorithm]^. 

The only difference with Algorithm is that on each step 
we perform probabilistic SVD decomposition of the matrix 
aS — so that S I 3- UDU^ ^ and project it onto the 
manifold of matrices of rank at. 


Algorithm 3: Low-rank Tensor SimRank for HSM 
Data: T - classes, IZ - relations, w - relation weights, 
{at} - approximation ranks 
Result: <S = {st(a, 6)}teT 

for t eT do 

I sP=I 

end 

k = 0 

ut = 0 
dt = 0 

repeat 

for t eT do 

sUew ^ Q 

for 7Z 3 r : t X p {1, 0} do 

sr^ = sr^ + wtp{W{rtp)W{rpt) + 

T W(^Ttp^UpdpUpW(^Tpt^^ 

end 

k = k + 1 

end 

for t eT do 

^new _ ^new rji 

at — Jt — 

ut,dt = ProbabilisticSVD(sJ^^^, at) 

g(k+l) ^ guew j 

end 

until Eter < e; 


which can be generalized to the tensor case as 

S = c'^W-yW-ySWy + (1 - c)l, 

1 


2.3 Convergence conditions 

Recall that the classical SimRank can be computed as a 
solution of the equation: 

S := WSW'^ - diagiWSW^) + L 

Fixed-point iteration converges if VF is a column-stochastic 
matrix. In the vector form (vec(-) operator maps an n x n 
matrix into a v? vector by taking column by column) that 
can be written a 0 

[W -I] Yec{S) - vec(diag(VF5'VF^)) -f vec(/) = 0, 

if matrix W is stochastic, then VF ( 8 ) VF is stochastic too. 

Tensor SimRank § computation can be equivalently writ¬ 
ten in the form: 

5 := y] w^W^SW'^ - diag(y3 w^W^SW'^ ) + I, (3) 

7 7 

or in the vectorized for 

( 8 ) W 7 — /] vec(5') — vec(diag(. • •)) + vec(/) = 0 . 

7 

Moreover, SimRank is also commonly approximated by the 
solution of the discrete Lyapunov equation: 

S' = cWSW'^ + (1 - c)7, 

^vec{ABC) = {C^ ® A)vec(B) 




7=1 7 

We conjecture that fixed-point iterations for if converge if: 
1. Each IF 7 is stochastic 
2 - = 1 

In the simplest form (we have no preferences among relations 
and classes) it reduces to (relations weight): 


'^tp — 


1 


3 . COMPUTATIONAL EXPERIMENT 
3.1 Synthetic data: convergence test 

To test convergence conditions we conducted series of tests 
on randomly generated sparse networks with different num¬ 
ber of classes: K G {3, 5, 7,10} and with randomly chosen 
number of objects in each Nreai G U[n/2-n] , N G {10 ... 100}, 
full network of relation types (all possible types relations ex¬ 
ists) with randomly chosen edges in each 

and default w matrix (no priority). All generated networks 
successfully converged that illustrates that convergent suf¬ 
ficient conditions listed in previous section were adequate, 
see Figures [2|^ 



















Figure 2: Average time spent on 10 iterations of 
algorithm on randomly network with K components, 
N objects in each 


mean diff after 10 iterations 



Figure 3: Mean Frobenius residual after 10 itera¬ 
tions of algorithm as function of number of objects 
(N), K components 

3.2 Synthetic data: similarity reconstruction 

To determine if model is capable of similarity reconstruc¬ 
tion we generated a tree graph from randomly distributed 
points on a plane and tested if model can reconstruct points 
spatial similarity basing only on their relations. 

On Figure blue point represent 0-level point that are 
connected to 1-level point (red), that are connected to 2- 
level points (green). 

We have measured the following similarity reconstruction 
S quality compared to real S obtained from generated point 
coordinates: 

C^ _ < Sac UUd Sab < Sac] 

that actually shows how many ”a is closer to c then to 5” 
relations were preserved. 

From Figure one can see that at level r ^ 0.3 model 
gets saturated, but at the level r ~ 0.15 models that use 
low-rank version of Tensor SimRank perform way better 
than the ’’pure” algorithm. The numbers in the brackets 



Figure 4: Random points for graph generation: blue 
points — zero level, red points — first level, green 
point — second level 


Algorithm 4: Graph generation algorithm 

Data: N - number of layers, {ni,..., un} - number of 
dots on each layer, r - connection radius 
Result: T, 71 
for k e {l.W} do 

pW ^ generate Uk point from U^q-i] 
if k > 0 then 

I ^ ^ if p{pf\pf~^'’) < r 

end 

end 


denote the dimensionality of the matrix space into which 
the similarity matrices were projected on each step (rank of 
approximation). 

3.3 Book-Crossing Dataset test 

The model was run on subsample from the Book-Crossing 
Dataset . We have extracted only those authors who had 
highest (toplOO) number of books in the collection. The final 
network had the following structure: 

T — {Book, Author, Year, Publisher} 

TZ — {isAuthorOf(-, •), publishedBy(-, •), publishedln(-, •)} 
#Book = 3625, ij^Author — 99, 

ifVear — 65, ^Publisher = 554 

Model convergence is shown on Figure [33] where success- 
full convergence to the best possible low-rank approxima¬ 
tion can be seen. The similarity structure is clearly visible 
on Year similarity matrix heatmap (Figure ( |3.3| ). We ex¬ 
pect diagonal dominance as soon as temporarily close years 
should be more or less similar in terms of authors and pub¬ 
lishers characteristic of that period. Tables 1 and 2 are ex¬ 
amples of ’’closest book” requests, we want to notice that 
no NLP-preprocessing was conducted, nevertheless model 
treated books from same storybook as similar basing on au¬ 
thor /publisher/year similarities. 

4. DISCUSSION AND FURTHER WORK 


























Figure 5: The value of Q{S, S) as a function of r 



Figure 6: Monotonic reduction in the residual 
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Figure 7: Year similarity matrix 


Table 1: Books closest to ’’Psychic Sisters” 
Psychic Sisters 

(Sweet Valley Twins and Friends, No 70) 

The Love Potion 

(Sweet Valley Twins and Friends, No 72) 

The Curse of the Ruby Necklace 
(Sweet Valley Twins and Friends Super, No 5) 
She’s Not What She Seems 
(Sweet Valley High No. 92) 

Are We in Love? 

(Sweet Valley High,No 94) 

Don’t Go Home With John 
(Sweet Valley High No. 90) 

In Love With a Prince 
(Sweet Valley High, No 91) 


Table 2: Books closest to ’’The Girl Who Loved Tom 

Gord on” _ 

The Girl Who Loved Tom Gordon 

Hearts In Atlantis (All You Want to Know) 

Blood And Smoke 
Blood And Smoke Cd 
Atlantis. 

The Body (Penguin Readers: Level 5) 

Storm of the Century 


Proposed model can be used in various problem areas 
where most of the information is available in the form of 
relations between entities rather than features of individual 
entities and no trivial vector representation of those entities 
can be induced. One can use the vector representation 

= Sij T [u-t][dt][itt]ij, 

to embed the notion of relations into classical machine learn¬ 
ing algorithms. Also, the proposed model can be used for 
relation generalisation, that might give interesting results 
since we work on heterogeneous graphs. 

Further model improvements might also include treating 
relations as objects too (probably, via heterogeneous hyper¬ 
graphs) and defining similarity matrix on relations. 

5. CONCLUSION 

This paper proposes the generalization of SimRank for 
heterogeneous networks and a method for its computation 
that exploits the fact that the resulting similarity matrix is 
block-diagonal, thus its components might be computed in 
an iterative fashion. The convergence conditions are pro¬ 
posed and successfully tested. Few perspective application 
areas are suggested. 
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